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Abstract 

o . 

, We show that a natural online algorithm for scheduling jobs on a heterogeneous multipro- 

. ■ cessor, with arbitrary power functions, is scalable for the objective function of weighted flow 

! plus energy. 

oo ■ 1 Introduction 

Many prominent computer architects believe that architectures consisting of heterogeneous pro- 
i cessors/cores, such as the STI Cell processor, will be the dominant architectural design in the 

Q ■ future [H [I3l [HI [171 [IB]- The main advantage of a heterogeneous architecture, relative to an ar- 

. chitecture of identical processors, is that it allows for the inclusion of processors whose design is 

, ^, 1 specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for 

that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a 
^ small number of high-power high-performance processors for critical jobs, and a larger number of 

QQ , lower-power lower-performance processors for less critical jobs. Naturally, the lower-power pro- 

' cessors would be more energy efficient in terms of the computation performed per unit of energy 

■ 



expended, and would generate less heat per unit of computation. For a given area and power bud- 
get, heterogeneous designs can give significantly better performance for standard workloads [51 117] : 
Q I Emulations in [12] suggest a figure of 40% better performance, and emulations in [18] suggest a fig- 

ure of 67% better performance. Moreover, even processors that were designed to be homogeneous, 
are increasingly likely to be heterogeneous at run time [8]: the dominant underlying cause is the 
increasing variability in the fabrication process as the feature size is scaled down (although run 
^ ' time faults will also play a role). Since manufacturing yields would be unacceptably low if every 

5-H ■ processor/core was required to be perfect, and since there would be significant performance loss 

from derating the entire chip to the functioning of the least functional processor (which is what 
would be required in order to attain processor homogeneity), some processor heterogeneity seems 
inevitable in chips with many processors/cores. 

The position paper [8] identifies three fundamental challenges in scheduling heterogeneous mul- 
tiprocessors: (1) the OS must discover the status of each processor, (2) the OS must discover the 
resource demand of each job, and (3) given this information about processors and jobs, the OS 
must match jobs to processors as well as possible. In this paper, we address this third fundamental 
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challenge. In particular, we assume that different jobs are of differing importance, and we study 
how to assign these jobs to processors of varying power and varying energy efficiency, so as to 
achieve the best possible trade-off between energy and performance. 

Formally, we assume that a collection of jobs arrive in an online fashion over time. When a job j 
arrives in the system, the system is able to discover a size pj € K>o, as well as a importance/weight 
Wj € IR>05 for that job. The importance wj specifies an upper bound on the amount of energy that 
the system is allowed to invest in running j to reduce j's flow by one unit of time (assuming that 
this energy investment in j doesn't decrease the flow of other jobs) — hence jobs with high weight 
are more important, since higher investments of energy are permissible to justify a fixed reduction 
in flow. Furthermore, we assume that the system knows the allowable speeds for each processor, 
and the system also knows the power used when each processor is run at its set of allowable speeds. 
We make no real restrictions on the allowable speeds, or on the power used for these speedsjU The 
online scheduler has three component policies: 

Job Selection: Determines which job to run on each processor at any time. 
Speed Scaling: Determines the speed of each processor at each time. 

Assignment: When a new job arrives, it determines the processor to which this new job is assigned. 

The objective we consider is that of weighted flow plus energy. The rationale for this objective 
function is that the optimal schedule under this objective gives the best possible weighted flow for 
the energy invested, and increasing the energy investment will not lead to a corresponding reduction 
in weighted flow (intuitively, it is not possible to speed up a collection of jobs with an investment 
of energy proportional to these jobs' importance). 

We consider the following natural online algorithm that essentially adopts the job selection and 
speed scaling algorithms from the uniprocessor algorithm in [5], and then greedily assigns the jobs 
based on these policies. 

Job Selection: Highest Density First (HDF) 

Speed Scaling: The speed is set so that the power is the fractional weight of the unfinished jobs. 

Assignment: A new job is assigned to the processor that results in the least increase in the 
projected future weighted fiow, assuming the adopted speed scaling and job selection policies, 
and ignoring the possibility of jobs arriving in the future. 

Our main result is then: 

Theorem 1.1 This online algorithm is scalable for scheduling jobs on a heterogeneous multiproces- 
sor with arbitrary power functions to minimize the objective function of weighted flow plus energy. 

In this context, scalable means that if the adversary can run processor i at speed s and power P{s), 
the online algorithm is allowed to run the processor at speed (1 + e)s and power P{s), and then 
for all inputs, the online cost is bounded by 0(/(e)) times the optimal cost. Intuitively, a scalable 
algorithm can handle almost the same load as optimal; for further elaboration, see |201I19| . Theorem 
11.11 extends theorems showing similar results for weighted fiow plus energy on a uniprocessor [5l [2] , 

^So the processors may or may not be speed scalable, the speeds may be continuous or discrete or a mixture, the 
static power may or may not be negligible, the dynamic power may or may not satisfy the cube root rule, etc. 
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and for weighted flow on a multiprocessor without power considerations [9]. As scheduhng on 
identical processors with the objective of total flow, and scheduling on a uniprocessor with the 
objective of weighted flow, are special cases of our problem, constant competitiveness is not possible 
without some sort of resource augmentation [Ml [3] . 

Our analysis is an amortized local-competitiveness argument. As is usually the case with such 
arguments, the main technical hurdle is to discover the "right" potential function. The most natural 
straw-man potential function to try is the sum over all processors of the single processor potential 
function used in [5]. While one can prove constant competitiveness with this potential in some 
special cases (e.g. where for each processor the allowable speeds are the non-negative reals, and 
the power satisfies the cube-root rule), one can not prove constant competitiveness for general 
power functions with this potential function. The reason for this is that the uniprocessor potential 
function from [5] is not sufficiently accurate. Specifically, one can construct configurations where 
the adversary has finished all jobs, and where the potential is much higher than the remaining 
online cost. This did not mess up the analysis in [5] because to finish all these jobs by this time the 
adversary would have had to run very fast in the past, wasting a lot of energy, which could then 
be used to pay for this unnecessarily high potential. But since we consider multiple processors, the 
adversary may have no jobs left on a particular processor simply because it assigned these jobs to a 
different processor, and there may not be a corresponding unnecessarily high adversarial cost that 
can be used to pay for this unnecessarily high potential. 

Thus, the main technical contribution in this paper is a seemingly more accurate potential 
function expressing the additional cost required to finish one collection of jobs compared to another 
collection of jobs. Our potential function is arguably more transparent than the one used in [5], 
and we expect that this potential function will find future application in the analysis of other power 
management algorithms. 

In section[3l we show that a similar online algorithm is 0(l/e)-competitive with (1 + e)-speedup 
for unweighted flow plus energy. We also remark that when the power functions Pi(s) are restricted 
to be of the form our algorithms give a 0(Q;^)-competitive algorithm (with no resource augmen- 
tation needed) for the problem of minimizing weighted flow plus energy, and an 0(a)-competitive 
algorithm for minimizing the unweighted flow plus energy, where a = maxj Oj. 

1.1 Related Results 

Let us first consider previous work for the case of a single processor, with unbounded speed, and 
a polynomially bounded power function P{s) = s". [21] gave an efficient offline algorithm to find 
the schedule that minimizes average flow subject to a constraint on the amount of energy used, 
in the case that jobs have unit work. [1] introduced the objective of flow plus energy and gave a 
constant competitive algorithm for this objective in the case of unit work jobs. [6] gave a constant 
competitive algorithm for the objective of weighted flow plus energy. The competitive ratio was 
improved by [15] for the unweighted case using a potential function specifically tailored to integer 
flow. [1] extended the results of [^ to the bounded speed model, and [I^ gave a nonclairvoyant 
algorithm that is 0(l)-competitive. 

Still for a single processor, dropping the assumptions of unbounded speed and polynomially- 
bounded power functions, [5] gave a 3-competitive algorithm for the objective of unweighted flow 
plus energy, and a 2-competitive algorithm for fractional weighted flow plus energy, both in the 
uniprocessor case for a large class of power functions. The former analysis was subsequently im- 
proved by [2] to show 2-competitiveness, along with a matching lower bound. 
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Now for multiple processors: |14| considered the setting of multiple homogeneous processors, 
where the allowable speeds range between zero and some upper bound, and the power function 
is polynomial in this range. They gave an algorithm that uses a variant of round-robin for the 
assignment policy, and job selection and speed scaling policies from [6], and showed that this 
algorithm is scalable for the objective of (unweighted) flow plus energy. Subsequently, [11] showed 
that a randomized machine selection algorithm is scalable for weighted flow plus energy (and 
even more general objective functions) in the setting of polynomial power functions. Both these 
algorithms provide non-migratory schedules and compare their costs with optimal solutions which 
could even be migratory. In comparison, as mentioned above, for the case of polynomial power 
functions, our techniques can give a deterministic constant-competitive online algorithm for non- 
migratory weighted flow time plus energy. (Details appear in the final version.) 

In non-power-aware settings, the paper most relevant to this work is that of [9], which gives a 
scalable online algorithm for minimizing weighted flow on unrelated processors. Their setting is even 
more demanding, since they allow the processing requirement of the job to be processor dependent 
(which captures a type of heterogeneity that is orthogonal to the performance energy-efficiency 
heterogeneity that we consider in this paper) . Our algorithm is based on the same general intuition 
as theirs: they assign each new job to the processor that would result in the least increment in 
future weighted flow (assuming HDF is used for job selection), and show that this online algorithm is 
scalable using an amortized local competitiveness argument. However, it is unclear how to directly 
extend their potential function to our power-aware setting; we had success only in the case that 
each processor had allowable speed-power combinations lying in {(0,0), {si,Pi)}. 

1.2 Preliminaries 

1.2.1 Scheduling Basics. 

We consider only non-migratory schedules, which means that no job can ever run on one processor, 
and later run on some other processor. In general, migration is undesirable as the overhead can 
be significant. We assume that preemption is allowed, that is, that jobs may be suspended, and 
restarted later from the point of suspension. It is clear that if preemption is not allowed, bounded 
competitiveness is not obtainable. The speed is the rate at which work is completed; a job j with 
size pj run at a constant speed s completes in ^ seconds. A job is completed when all of its work 
has been processed. The flow of a job is the completion time of the job minus the release time of 
the job. The weighted flow of a job is the weight of the job times the flow of the job. For a, t > rj, 
let Pj{t) be the remaining unprocessed work on job j at time t. The fractional weight of job j at 

this time is Wj^-^^- The fractional weighted flow of a job is the integral over times between the 
job's release time and its completion time of its fractional weight at that time. The density of a 
job is its weight divided by its size. The job selection policy Highest Density First (HDF) always 
runs the job of highest density. The inverse density of a job is its size divided by its weight. 

1.2.2 Power Functions. 

The power function for processor i is denoted by Pi{s), and specifles the power used when processor 
is run at speed s. We essentially allow any reasonable power function. However, we do require the 
following minimal conditions on each power function, which we adopt from [5]. We assume that 
the allowable speeds are a countable collection of disjoint subintervals of [0, oo). We assume that 
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all the intervals, except possibly the rightmost interval, are closed on both ends. The rightmost 
interval may be open on the right if the power Pi{s) approaches infinity as the speed s approaches 
the rightmost endpoint of that interval. We assume that Pi is non-negative, and Pi is continuous 
and differentiable on all but countably many points. We assume that either there is a maximum 
allowable speed T, or that the limit inferior of Pi{s)/s as s approaches infinity is not zero (if this 
condition doesn't hold then, then the optimal speed scaling policy is to run at infinite speed). 
Using transformations specified in [5], we may assume without loss of generality that the power 
functions satisfy the following properties: P is continuous and differentiable, P{0) = 0, P is strictly 
increasing, P is strictly convex, and P is unbounded. We use Qi to denote P~^; i.e., Qi{y) gives 
us the speed that we can run processor i at, if we specify a limit of y. 

1.2.3 Local Competitiveness and Potential Functions. 

Finally, let us quickly review amortized local competitiveness analysis on a single processor. Con- 
sider an objective G. Let GA_{t) be the increase in the objective in the schedule for algorithm A 
at time t. So when G is fractional weighted fiow plus energy, GA{t) is P^ + w^^, where P^ is the 
power for A at time t and w^^ is the fractional weight of the unfinished jobs for A at time t. Let 
OPT be the offline adversary that optimizes G. A is locally c-competitive if for all times t, if 
GA{t) < c • Goprit)- To prove A is (c + (i)-competitive using an amortized local competitiveness 
argument, it sufflces to give a potential function ^(t) such that the following conditions hold (see 
for example |19j). 

Boundary condition: $ is zero before any job is released and is non-negative after all jobs are 
finished. 

Completion condition: <^> does not increase due to completions by either A or OPT. 
Arrival condition: $ does not increase more than d ■ OPT due to job arrivals. 
Running condition: At any time t when no job arrives or is completed, 

GA{t) + ^<C-GoPT{t) (1) 

The sufficiency of these conditions for proving (c+(i)-competitiveness follows from integrating them 
over time. 

2 Weighted Flow 

Our goal in this section is to prove Theorem II. 11 We first show that the online algorithm is (1 + e)- 
speed 0(-)-competitive for the objective of fractional weighted flow plus energy. Theorem 1 1 . 1 1 then 
follows since HDF is (1 -|- e)-speed 0(i)-competitive for fixed processor speeds [7] for the objective 
of (integer) weighted flow. 

Let OPT be some optimal schedule minimizing fractional weighted flow. Let w\^^{q) denote the 
total fractional weight of jobs in processor i's queue that have an inverse density of at least q. Let 

j ^ai(O) be the total fractional weight of unfinished jobs in the queue. Let w\ := X^j'W* j be 
the total fractional weight of unfinished jobs in all queues. Let w^^iiq)-, w^oi^ be similarly 

defined for OPT. When the time instant being considered is clear, we drop the superscript of t 
from all variables. 

We assume that once OPT has assigned a job to some processor, it runs the BCP algorithm [5] 
for job selection and speed scaling — i.e., it sets the speed of the i*'^ processor to Qi{wo,i), and hence 
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the i^^ processor uses power Wo,i, and uses HDF for job selection. We can make such an assumption 
because the results of [5] show that the fractional weighted flow plus energy of the schedule output 
by this algorithm is within a factor of two of optimal. Therefore, the only real difference between 
OPT and the online algorithm is the assignment policy. 

2.1 The Assignment Policy 

To better understand the online algorithm's assignment policy, define the "shadow potential" for 
processor i at time t to be 

^a,i{t)= / / TTMdxdq (2) 

Jq=0Jx=0 HiK^} 

The shadow potential captures (up to a constant factor) the total fractional weighted flow to 
serve the current set of jobs if no jobs arrive in the future. Based on this, the online algorithm's 
assignment policy can alternatively be described as follows: 

Assignment Policy. When a new job with size pj and weight Wj arrives at time t, the assignment 
policy assigns it to a processor which would cause the smallest increase in the shadow potential; 
i.e. a processor minimizing 

dxdq — / dx dq 



q=OJx=0 Qi{x) Jq^Qj^^Q Qi{x) 

I dx dq 

2.2 Amortized Local Competitiveness Analysis 

We apply a local competitiveness argument as described in subsection 11.21 Because the online 
algorithm is using the BCP algorithm on each processor, the power for the online algorithm is 
YliPi{Qi{'^a,i)) = Wa- Thus Ga = 2u'a. Similarly, since OPT is using BCP on each processor 
GoPT = 2wo- 

2.2.1 Defining the potential function 

For processor i, define the potential 

2 roo ^ 

Mt) = - / jrr-.dxdq (3) 

Here (•)-!_ = max(-,0). The global potential is then defined to be = Firstly, we 

observe that the function x/Qi{x) is increasing and subadditive. Then, the following lemma will 
be useful subsequently, the proof of which will appear in the full version of the paper. 

Lemma 2.1 Let g be any increasing subadditive function with g{0) > 0, and Wa,Wo,Wj G M>o. 
Then, 

g{x) dx — I g{x) dx < 2 g{wo + x) dx 

J X=(Wa—Wo — Wj)^. J X = 
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That the boundary and completion conditions are satisfied are obvious. In Lemma [2.2l we prove 
that the arrival condition holds, and in Lemma 12.31 we prove that the running condition holds. 

Lemma 2.2 The arrival condition holds with d = f • 

Proof: Consider a new job j with processing time pj, weight Wj and inverse density dj = pj/wj, 
which the algorithm assigns to processor 1 while the optimal solution assigns it to processor 2. 
Observe that rwo,2(q)+Wj ^ j ^ ■ ^^j^ increase in OPT's fractional weighted flow due to 

Jq=0 Jx=Wo,2{q) Q2{x) ^ ° 

this new job j. Thus our goal is to prove that the increase in the potential due to job j's arrival is 
at most this amount. The change in the potential A<I> is: 

dx — I „ , , dx dq 




lx={Wa,i{q)-Wo,i{q))+ Ql{^) Jx={wa,2{q)-Wo,2{q)-Wj)+ Q2{x) 

Now, since x/Qi{x) is an increasing function we have that 

{■Wa,l{q)~Wo.i{q)+Wj)+ ^ r-Wa,i{q)+Wj ^ 

dx < ^ , , dx 



lx={wa,i{q)-Wo,iiq))+ Ql{^) Jx=Wa,i(q) Ql{^) 

and hence the change of potential can be bounded by 

Wa,l(<})+^j X r{wa,2iq)-Wo,2(q))+ X \ 

I 7)~r\ 7TT\ 

lx=Wa.l(q) Hiy^J J X = {Wa,2{q)-Wo,2(q)-Wj)+ J 

Since we assigned the job to processor 1, we know that 

Wa,i{q)+Wj ^ rdj rWa,2{q)+Wj ^ 

dxdq < / dx dq 




I q=Q J x=Wa-i{q) Qli^) J q=0 J x=Wa,2{q) Q2{x) 

Therefore, the change in potential is at most 

2 ( l-^a,2il)+'^i X r{Wa,2{q)-Ulo,2{q))+ x \ 

A^> < - / / — — dx - / — — dx dq 

e Jq=0 \Jx=Wa,2{q) ^2{X) A=(«;,,2(g)-t«o.2(g)-t«j)+ J 

Applying Lemma l2.H we get: 

Acl><(2.^)/ / T^.dxdq 



□ 



Lemma 2.3 The running condition holds with constant c= 1 + 7- 

Proof: Let us consider an infinitesimally small interval [t, t + dt) during which no jobs arrive and 
analyze the change in the potential ^{t). Since = ^i{t), we can do this on a per-processor 
basis. Fix a single processor i, and time t. Let Wi{q) := {wa,i{q)—Wo,i{q))+, and Wi := {wa,i—Wo,i)+- 
Let qa and qo denote the inverse densities of the jobs being executed on processor i by the algorithm 
and optimal solution respectively (which are the densest jobs in their respective queues, since both 
run HDF). Define Sa = Qi{wa,i) and Sq = Qi{wo.i)- Since we assumed that OPT uses the BCP 
algorithm on each processor, OPT runs processor i at speed Sq- Since the online algorithm is also 
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using BCP, but has (1 + e)-speed augmentation, the onhne algorithms runs the processor at speed 
(1 + e)sa- Hence the fractional weight of the job the online algorithm works on decreases at a rate 
of Sa(l + e)/qa- Therefore, the quantity Wa,i{q) drops by Sadt{l + t)/qa for q € [0,qa]- Likewise, 
Wo,i{q) drops by Sq dt/qo for q G [0, qo] due to the optimal algorithm working on its densest job. We 
consider several different cases based on the values of qo,qa,Wo,i, and Wa^i and establish bounds on 
d^i{t)/dt] Recall the definition of ^i{t) from equation ([3]): 

Mt) = - / / ——-dxdq 

e Jq=Q Jx=0 

Case (1): Wa,i < Wo,i- The only possible increase in potential function occurs due to the decrease 
in Wo,i{q), which happens for values of g G [0,go]- But for g's in this range, Wa,i{q) < Wa,i and 
Wo,i{Q) = Wo^i- Thus the inner integral is empty, resulting in no increase in potential. The running 
condition then holds since Wa,i < Wo,i- 

Case (2): Wa,i > Wo,i'- '^o quantify the change in potential due to the online algorithm working, 
observe that for any q € [0,(7^], the inner integral of <l>j decreases by 

dx — dx = — — — -— -(1 + e) 



x=o Qi{x) Jx=o Qi{x) Qi{wi{q)) 

Here, we have used the fact that dt is infinitisemally small to get the above equality. Hence, the 
total drop in $j due to the online algorithm's processing is 

2 wAq) , ,Sadt ^ 2 f'' Wi , ^Sadt , 

' {l+e)-^dq > - {l + e)-^dq 



e Jg=o Qi{wi{q)) qa e Jg^Q Qi{wi) qa 

2 w. 



e Qi{wi) 



{l + e)sadt 



Here, the first inequality holds because x/Qi{x) is a non-decreasing function, and for all q G [0, qa\: 
we have Wa,i{q) = Wa,i and Wo,i{q) < Wo^i and hence Wi{q) > Wi. 

Now to quantify the increase in the potential due to the optimal algorithm working: observe 
that for q G [0, go]; the inner integral of increases by at most 

X , Wi{q) Sodt 

dx = 

x=w,{q) Qi{x) Qi{wi{q)) qo 

Again notice that we have used that fact that here dt is an infinitesimal period of time that in the 
limit is zero. Hence the total increase in $j due to the optimal algorithm's processing is at most 

2 fi" wAq) Sodt , 2 n° Wi Sodt ^ 2 Wi 

dq < - — — — r dq=-—— — -Sodt. 



e Jq=Q Qi{wi{q)) qo e 7^=0 Qi{wi) qo e Qi{wi) 

Again here, the first inequality holds because x/Qi{x) is a non-decreasing function, and for all 
q G [0, go], we have Wa,i{q) < Wa^i and Wo^i{q) = Wo,i and hence Wi{q) < wi. 
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Putting the two together, the overall increase in ^i{t) can be bounded by 



—7^ < -777 N - 1 + e Sa + So] 

< -^e{Wa,i - Wo,i) = -2{Wa,i - Wo,i) 

It is now easy to verify that by plugging this bound on into the running condition that one 

gets a valid inequality. 

Case (3): Wa,i = Wo^i- In this case, let us just consider the increase due to OPT working. The 
inner integral in the potential function starts off from zero (since Wa,i — Wo,i = 0) and potentially 
(in the worst case) could increase to 

Sq dt 

qo fl 
Qi{x) 

(since Wo,i drops by Sodt/qo and Wa,i cannot increase). However, since x/Qi{x) is a monotone 
non-decreasing function, this is at most 

3o dt 

qo Wo,i , So at Wo,i 

dx 



Qi{wo,i) qo Qi{wo,i) 

Therefore, the total increase in the potential ^i{t) can be bounded by 

2 r° Wo,i Sodt 2 Wo,i 2 

■ dq = -So dt—— = -Wo,i dt 



e Jg=o Qi{wo,i) qo e Qi{wo,i) e 

It is now easy to verify that by plugging this bound on ^^^^^ into the running condition, and using 
the fact that Wa,i = Wo,i, one gets a valid inequality. □ 

3 Algorithm for Unweighted Flow 

In this section, we give an immediate assignment based scheduling policy and show that it is 0(l/e)- 
competitive against a non-migratory adversary for the objective of unweighted flow plus energy, 
assuming the online algorithm has resource augmentation of (1 + e) in speed. Note that this result 
has a better competitiveness than the result for weighted flow from Section [2l but holds only for 
the unweighted case. 

We begin by giving intuition behind our algorithm, which is again similar to that for the 
weighted case. Let OPT be some optimal schedule. However, for the rest of the section, we assume 
that on a single machine, the optimal scheduling algorithm for minimizing sum of flow times plus 
energy on a single machine is that of Andrew et al.[2j which sets the power at any time to be 
Q{n) when there are n unfinished jobs, and processes jobs according to SRPT. Since we know that 
this ALW algorithm [2] is 2-competitive against the optimal schedule on a single processor, we will 
imagine that, once OPT has assigned a job to some processor, it uses the ALW algorithm on each 
processor. Likewise, once our assignment policy assigns a job to some processor, our algorithm also 



9 



runs the ALW algorithm on each processor. Therefore, just hke the weighted case, the crux of our 
algorithm is in designing a good assignment policy, and arguing that it is 0(l)-competitive even 
though our algorithm and OPT may schedule a new job on different processors with completely 
different power functions. 



3.1 Algorithm 

Our algorithm works as follows: Each processor maintains a queue of jobs that have currently been 
assigned to it. At some time instant t, for any processor i, let n\^{q) denote the number of jobs in 
processor i's queue that have a remaining processing time of at least q. Let n* ^ denote the total 
number of unfinished jobs in the queue. Also, let us define the shadow potential for processor i at 
this time t as 

n' .(g) 

Note that the shadow potential ^a,i{t) is the total future cost of the online algorithm (up to a 
constant factor) assuming no jobs arrive after this time instant, and the online algorithm runs the 
ALW algorithm on all processors (i.e., the job selection is SRPT, and the processor is run at a speed 
of Qi{n\^)). Now our algorithm is the following: 

When a new job arrives, the assignment policy assigns it to a processor which would cause the 
smallest increase in the "shadow potential"; i.e., a processor minimizing 

q=0 



Q^{j) Jq=0 ^ Qi{j) Jq=0Qi{<M + ^) 



The job selection on each processor is SRPT (Shortest Remaining Processing Time), and we set the 
power of processor i at time t to n* j. Once the job is assigned to a processor, it is never migrated. 

3.2 The Amortized Local-Competitive Analysis 

We again employ a potential function based analysis, similar to the one in Section [2j 



3.2.1 The Potential Function. 

We now describe our potential function For time t and processor i, recall the definitions n* ■ and 
n* j(g) given above; analogously define n* ^ as the number of unfinished jobs assigned to processor 
i by the optimal solution at time t, and n^oiiq) to be the number of these jobs with remaining 
processing time at least q. Henceforth, we will drop the superscript t from these terms whenever 
the time instant t is clear from the context. 

Now, we define the global potential function to be $(t) = ^i(i), where $i(t) is the potential 
for processor i defined as: 

Mt) = - E j/Q'dj)dq (5) 
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Recall that (x)+ = max(x,0), and Qi = P- ^. Notice that if the optimal solution has no jobs 
remaining on processor i at time t, we get ^i{t) is (within a constant off) simply 

3.2.2 Proving the Arrival Condition. 

We now show that the increase in the potential ^ is bounded (up to a constant factor) by the 
increase in the future optimal cost when a new job arrives. Suppose a new job of size p arrives at 
time t, and suppose the online algorithm assigns it to processor 1 while the optimal solution assigns 
it to processor 2. Then <I>i increases since na^i{q) goes up by 1 for all q G [0,p], $2 could decrease 
due to no^2{<l) dropping by 1 for all q G [0,p], and $j (for i ^ {1, 2}) does not change. 

Let us first assume that na,i{q) > no^i{q) for all q G [0,p] and for i G {1, 2}; we will show below 
how to remove this assumption. Under this assumption, the total change in potential $ is 

4 P / raa,i(g) - no^i(g) + 1 ?^a,2(g) - ^^0,2(9) n 

e Jq=o Qi{na,i{q) - no,i{q) + 1) Q2{na,2{q) - no,2{q)) 



But since x/Q{x) is increasing this is less than 

4 r , ^^a,l(9) + l ^^a,2(g) - "0,2(9) 



dq (6) 



e Jq=Q Qi{na,i{q) + 1) Q2{na,2{Q) - "0,2(9)) 
By the greedy choice of processor 1 (instead of 2), this is less than 

4 I na,2(g) + l "a,2(g) - "0,2(9) N , /„n 

(9))^ ' ^ ^ 

Now, since x/Q{x) is subadditive this is less than | i\ dq, which in turn is (within a 



e 7^=0 Q2{na,2{q) + 1) Q2{na,2{q) - "0,2(9)) 

4 p "o,2(g) + l 

e Jg=0 Q2{noa{q) + l) 



factor of precisely the increase in the future cost incurred by the optimal solution, since we had 
assumed that OPT also runs the ALW algorithm on its processors. 

Now suppose "0,1(9) < "0,1(9) for some q G [0,p]. There is no increase in the inner sum of $1 for 
such values of g, and hence we can trivially upper bound this zero increase 

I J^*Lg Q2[n'^2{q)+i) ' *° discharge the assumption for processor 2, note that if na,2(9) < "0,2(9) 
for some q, there is no decrease in the inner sum of $2 for this value of g, but in this case we can 

simply use < ";°,2(g)+i £ such values of q. Therefore, we get the same bound of 

^ Qi(na,2(g)+l) — Qi(no,2(g)+l) ^ ' ° 

4 r + 1 

e Jq=^ Q2{no,2{q) + 1) 

on the increase in all cases, thus proving the following lemma. 

Lemma 3.1 The arrival condition holds for the unweighted case with d = ^. 

3.2.3 Proving the Running Condition. 

In this section, our goal is to analyze the change in <^ in an infinitesimally small time interval 
[t,t + dt) and compare d^/dt to dA\g/dt and dOPT/dt. We do this on a per-processor basis; 
let us focus on processor i at time t. Recall (after dropping the t superscripts) the definitions of 
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na,i{q),no,i{q),na,i and no^i from above, and define ni{q) := {na,i{q) -no,i{q))+- Finally, let qa and 
qo denote the remaining sizes of the jobs being worked on by the algorithm and optimal solution 
respectively at time t; recall that both of them use SRPT for job selection. Define Sa = Qi{na,i) and 
■So = Qi{no,i) to be the (unaugmented) speeds of processor i according to the online algorithm and 
the optimal algorithm respectively — though, since we assume resource augmentation, our processor 
runs at speed (1 + e)Qi{na,i)- Hence na.i{q) drops by 1 for g G [qa - (1 + <^)sadt,qa] and no,j(g) 
drops by 1 for g G {qo — Sodt,qo] for the optimal algorithm. Let la ■= (qa — (1 + e)sadt,qa] and 
lo ■= (qo — So dt, qo] denote these two intervals. Let us consider some cases: 

Case (1): Ua^i < rio.i- The increase in potential function may occur due to no^i{q) dropping by 
1 in g G lo- However, since Ua^i < Uoa, it follows that na^i{q) < n^^i < rio^i = Uo.iiq) for all 
q G Iq] the equality follows from OPT using SRPT. Consequently, even with no^i{q) dropping by 
1; na^iiq) — no^i{q) < and there is no increase in potential, or equivalently d^i{t)/dt < 0. Hence, 
in this case, 4?ia^j + d^i{t)/dt < ^no,i- 

Case (2a): Ua^i > no,i, and qa < qo'. For q G la, the inner summation of drops by 
Q^n^M-nfM) ' '^a,i{q) decreases by 1. Moreover, na,i{q) = na,i and no,i{q) = no,i, because 

both Alg and OPT run SRPT, and we're considering q < qa < qo- For q G lo, the inner summation 
of $j increases by q ."^^ ^gMn'I ifgT- 1 ) ) " however, we have na,i{q) < na,i - 1 and no,i{q) = Uo^i 
because qa < qo, and no^i{q) = Uo^i because OPT runs SRPT. Therefore the increase is at most 
-»o,i q ^ lo. Combining these two, we get 

dt ~ e Qi{na,i - no,i) 

_4. _ J - (1 + e)Qi{na^i) + Qiino,i)] 

— [jla^i — TLo^i) ^ , ^ ^ 

€ ^i\f^a,i f^o,i) 

^ ~^Qi(j^a,i) ^ At \ 

< - {na,i - no,i) — — ^ r < -4(na,i - Uo^i), 

£ Wiv^a,i f^o,i) 

where we repeatedly use that Qi{-) is a non-decreasing function. This implies that Ana^i+d^i{t) / dt < 

Case (2b): Ua.i > rio^i, and qa > qo- In this case, for q G lo, the inner summation of increases 
by Q^"n^'';fgyl'(n'^^f(gyll)) • Also, we have na,i(g) = Ua^i and no,i{q) = Uo^i, because q < qo < qa and 
both algorithms run SRPT. Therefore the overall increase in potential function can be bounded by 
U^Z'-ni^^li) ^"^^^ Moreover, for q G /«, the inner summation of drops by Q^^'lzSg)) - 
Also, na^i{q) = Ua^i and no^i{q) < Uo^i — 1, because qo < qa and there was a job of remaining 
size qo among the optimal solution's active jobs. Thus the potential function drops by at least 
7 n^iy ^y^li^ (1 + ()sadt, since x/Qiix) is a non-decreasing function. Combining these terms, 

d^i{t) ^ 4 Ug^i - no,i + 1 r (i \ Aq +s1 
dt ~ eQi{na,i-no,i + l)^ 

^4, _^ ^ ^ A-{l + €)Qi{na,i) + Qi{no,i)] 
e Qj("-a,i - no,i + 1) 

4 

< --e{na,i - no,i + 1) < -4(na,i - no,i) 
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In the above, we have used the fact that no,j > 1, and consequently, Qi{na^i) > Qiiria^i — no,i + 1). 
Therefore, in this case too we get 4na,i + d^i{t)/dt < 4?io_j. 

Case (2c): Ua^i > Uo^i, and qa = Qo- Since na,i > rio^i, and Qi is an increasing function, Sa > Sq 
and thus Iq C la- For q in the interval Iq, the term no^i{q) drops by 1 and the term na^i{q) drops 
by 1, and therefore there is no change in n.a^i{q) — no.iig)- For g € /a \ lo, the inner summation 
for $j drops by Q'^(na'^(g)-nl%)) - ^^so, na,i{q) = na,i and no,i{q) = no,i, and the decrease in 
potential function is -((1 + e)sadt — So dt) g^j^ ^j^g analysis in Case (2a) implies that 

^na^i + d^i{t)/dt < Auo^i in this case as well. 
Summing over all z, we get 

Lemma 3.2 The running condition holds for the unweighted case with constant 4. At any time t 
when there are no job arrivals, we have 

Combining Lemmas 13.11 and 13.21 with the standard potential function argument indicated in Sec- 
tion [TTJl we get the following theorem. 

Theorem 3.3 There is a {l + e)-speed 0{l/e) -competitive immediate- assignment algorithm to min- 
imize the total flow plus energy on heterogeneous processors with arbitrary power functions. 

Acknowledgments: We thank Sangyeun Cho and Bruce Childers for helpful discussions about 
heterogeneous multicore processors, and also Srivatsan Narayanan for several useful discussions. 
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A Estimating the Future Cost of BCP 

Suppose we have a set of n jobs, with weight Wj and processing time pj for job j (1 < j < n) such 
that pi/wi < P2/W2 < ... < Pn/wn- Let dj = Pj/wj denote the inverse density of a job j. We 
now explain how we get the estimate of the future cost of this configuration when we run the BCP 
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algorithm, i.e. HDF at a speed of Q{w^), where is the total fractional weight of unfinished jobs 
at time t. 

Firstly, observe that by virtue of our algorithm running HDF, it schedules job 1 followed by job 
2, etc. Also, as long as the algorithm is running job 1, it runs the processor at speed Q{W>2+wi{t)) , 
where W>2 ■= W2 + W3 + . . . Wn and wi{t) is the fractional weight of job 1 remaining. Secondly, since 
our algorithm always uses power equal to the fractional weight remaining, the rate of increase of 
the objective function at any time t is simply 2w^. Therefore, the following equations immediately 
follow: 

rl A 

GA{t) = ^ = 2{W>2 + wi{t)) 



dt \pi 

dA ^ W>2 + ^i{t) 



dwi{t) \wi) Q{W>2 + wi{t)) 

Jx=W>2+wi Q{W>2 + X) 

That is, the total cost incurred while job 1 is being scheduled is 

/■^>2+«'i w>2+x ^ ^ f"' rW^2+m W>2 + x ^ ^ 
2 di — y-^ r- dx = 2 / — r- dx dq 

Jx=W>2 Q{W>2+X} Jq=oJx=W>2 Q{W>2 + X) 

Similarly, while any job i is being scheduled, we can use the same arguments as above to show that 
the total fractional flow incurred is 



/x=H/>(i+i) Q{W>(^i+i)+x) 7g=o A=iy>(,+i) QiW>(^i+i)+x) 

Summing over i, the total fractional flow incurred by our algorithm is 

^ Jg=0 Jx=H'>(,+i) Q{W>^i+l) + X) 

Rearranging the terms, it is not hard to see (given di < d2 < ■ ■ ■ < dn) that this is equal to 

POO r-w(q) ^ 

^ / / T^dxdq 

Jq=0Jx=0 

where w{q) is the total weight of jobs with inverse density at least q. 

B Subadditivity of x/Q(x) 

Let Q{x) : ]R>o — > M>o be any concave function such that Q{0) > and Q'{x) > for all x > 0, 
and let g{x) = x/Q{x). Then the following facts are true about g{-). 

(a) g{-) is non-decreasing. That is, g{y) > g{x) for all y > x. 
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(b) g{-) is subadditive. That is, g{x) + g{y) > g{x + y) for all x,y G 



^>o 



To see why the first is true, consider x and y = Xx for some A > 1. Then, showing (a) is 
equivalent to showing 

Ax X 
Q{Xx) - Q{x) 

But this reduces to showing Q{Xx) < XQ{x) which is true because Q{-) is a concave function. To 
prove the second property, we first observe that the function 1/Q{x) is convex. This is because the 
second derivative of \/Q{x) is 

-Q{xfQ"{x) + 2Q{x)Q'{xf 
Q{xY 

which is always non-negative for all x > 0, since Q{x) is non-negative and Q"{x) is non-positive for 
all x > 0. Therefore, since 1/Q{-) is convex, it holds for any x, y, and a > 0, /3 > that 

Q{x) ^ Q{y) ^ 1 



a + /3 ~ Q(qx+§y. 



Plugging in a = X and /3 = y, we get 

_L _y_ . 

Q(^) ^ Qiy) > 1 



which implies 

x _^ y ^ x + y 



Q{x) ' Qiy) - Q(2!±^) 



But since Q(-) is non-decreasing, we have Q(x + y) > ) and hence 

X y x + y 

Q{x)^Qiy)~Q{x + y) 

C Missing Proofs 

Proof of Lemma 12. It We first show that 

Wa+Wj l-(Wa~Wo)+ rWj 

g{x)dx— I g{x)dx< I g{wo + Wj)dx 

X=Wa J X={Wa— Wo— Wj)-^ J X=0 

and then argue that J^l^giwo + Wj) dx < 2 J^^l^giwo + x) dx because g{-) is subadditive. To this 
end, we consider several cases and prove the lemma. Suppose Wa is such that Wa > Wq + wj: in 
this case we can discard the (•)+ operators on all the limits to get 

Wa+Wj nWa-Wo 

g{x) dx — g{x) dx 



X=Wa J X=Wa— Wo— Wj 

g{wa + x) - g{wa - Wo - Wj + x)] dx < / g{wo + Wj)dx 

x=Q \ / Jx=Q 
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Here, the final inequality follows because g{-) is a subadditive function. On the other hand, suppose 
it is the case that Wa < Wq, then both limits {wa — Wo)+ and {wa — Wo — Wj)^ are zero, and therefore 
we only need to bound f^^^^^ g{x) dx, which can be done as follows: 

g{x) dx = g{wa + x)dx < I g{wo + wj) dx 

X=Wa J X = Jx=0 

Finally, if Wa = Wq + 6 for some 6 G {0,Wj), we first observe that J^Zlw^-li!' -w')+ di^)'^^ simplifies 
to J^^Qg{x)dx. Therefore, we are interested in bounding 

Wa+Wj nS /-Wa+Wj—S l-Wa+Wj l-S 

g{x) dx — I g{x) dx = g{x) dx + g{x) dx — I g{x) dx 

=Wa J X=0 J X=Wa J X=Wa+Wj—5 J X = Q 

< {wj — 5)g{wa + Wj — 5) + I g{wa + Wj — 6) dx < / g{wo + Wj) dx 

Jx=0 Jx=0 



Here again, in the second to last inequality, we used the fact that g{-) is subadditive and therefore 

lx=Wa+Wj- 



g{wa + Wj — 5 + x)—g{x) < g{wa + Wj — 6), for all values of x > 0; hence we get jx='w"+w -6 fi'(^) 



L=o 9{.x) dx < J^^Q g{wa + Wj - 6) dx. 

To complete the proof, we need to show that g{wo + Wj) dx < 2 di'^o + x) dx. To see this, 
consider the following sequence of steps: 

For any x € [0, Wj], since g is subadditive, we have 

g{wo + Wj - x) + g{x) > g{wo + Wj) 

Integrating both sides from x = to Wj we get 

g{wo + Wj — x) dx + / g{x) dx > g{wo + Wj) dx 

x=0 Jx=0 Jx=Q 

which is, by variable renaming, equivalent to 

9{wo + y)dy+ / g{x)dx> l g{wo + Wj)dx 

3/=0 Jx=0 Jx=Q 

But since g{-) is non-decreasing, we have J^l^gix) dx < j^l^giwo + x) dx and therefore 



g{wo + y)dy+ / g{wo + x)dx> I g{wo + Wj)dx 

y=0 Jx=0 Jx=0 

which is what we want. □ 
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